skip to main content
10.1145/584792.584829acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Topic-based document segmentation with probabilistic latent semantic analysis

Published:04 November 2002Publication History

ABSTRACT

This paper presents a new method for topic-based document segmentation, i.e., the identification of boundaries between parts of a document that bear on different topics. The method combines the use of the Probabilistic Latent Semantic Analysis (PLSA) model with the method of selecting segmentation points based on the similarity values between pairs of adjacent blocks. The use of PLSA allows for a better representation of sparse information in a text block, such as a sentence or a sequence of sentences. Furthermore, segmentation performance is improved by combining different instantiations of the same model, either using different random initializations or different numbers of latent classes. Results on commonly available data sets are significantly better than those of other state-of-the-art systems.

References

  1. A. Basu, I.R. Harris, and S. Basu. Minimum distance estimation: The approach using density-based distances. In G.S. Maddala and C.R. Rao, editors, Handbook of Statistics volume 15,pages 21--48. North-Holland, 1997.Google ScholarGoogle Scholar
  2. D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. Machine Learning 34:177--210, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. In Proceedings of NIPS-2001 Vancuver, BC, Canada, 2001.Google ScholarGoogle Scholar
  4. T.Brants.Test data likelihood for PLSA models. In ACM SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval Tampere, Finland, 2002.Google ScholarGoogle Scholar
  5. F.Y.Y. Choi. Advances in domain independent linear text segmentation. In Proceedings of NAACL-2000 pages 26--33, Seattle, WA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F.Y.Y. Choi. Improving the efficiency of speech interfaces for text navigation. In Proceedings of the IEE colloquium: Speech and Language Processing for Disabled and Elderly People 2000.Google ScholarGoogle Scholar
  7. F.Y.Y. Choi, P.Wiemer-Hastings, and J.More. Latent semantic analysis for text segmentation. In L.Lee and D.Harman, editors, Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing pages 109--117, 2001.Google ScholarGoogle Scholar
  8. W.B. Croft, S.Cronen-Townsend, and V. Larvrenk. Relevance feedback and personalization: A language modeling perspective. In DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries 2001.Google ScholarGoogle Scholar
  9. S.C. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, and R.A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6):391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  10. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society 39(1):1--21,1977.Google ScholarGoogle Scholar
  11. D. Gildea and T. Hofmann. Topic-based language models using em. In Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH), pages 2167--2170, 1999.Google ScholarGoogle Scholar
  12. M.A. Hearst and C. Plaunt. Subtopic structuring for full-length document access. In Research and Development in Information Retrieval pages 59--68, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of SIGIR-99 pages 35--44, Berkeley, CA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning Journal 42(1):177--196, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Kailath. The divergence and bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Tech., COM-15:52--60,1967.Google ScholarGoogle ScholarCross RefCross Ref
  16. H. Kozima. Text segmentation based on similarity between words. In Meeting of the Association for Computational Linguistics pages 286--288, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Kullback and R.A. Leibler. On information and sufficiency. Annals of Mathematical Statistics 22:79--86, 1951.Google ScholarGoogle ScholarCross RefCross Ref
  18. V. Lavrenk, J. Allan, E. DeGuzman, D. LaFlamme, V. Pollard, and S. Thomas. Topic-based language models using em. In Proceedings ofthe 6th European Conference on Speech Communication and Technology (EUROSPEECH), pages 2167--2170, 1999.Google ScholarGoogle Scholar
  19. L. Lee. Measures of distributional similarity. In 37th Annual Meeting of the Association for Computational Linguistics pages 25--32, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. Li and K. Yamanishi. Topic analysis using a finite mixture model. In Proceedings of Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 35--44, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. H. Li and K. Yamanishi. Topic analysis using a finite mixture model. IPSJ SIGNotes Natural Language (NL), 139(009), 2000.Google ScholarGoogle Scholar
  22. L. Pevzner and M. Hearst. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics 28(1):19--36, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J.W. Tukey. Exploratory Data Analysis Addison Wesley Longman,Inc., Reading, MA, 1977.Google ScholarGoogle Scholar

Index Terms

  1. Topic-based document segmentation with probabilistic latent semantic analysis

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management
        November 2002
        704 pages
        ISBN:1581134924
        DOI:10.1145/584792

        Copyright © 2002 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 November 2002

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader